PacBio #143

antgonza · 2025-09-01T18:02:34Z

No description provided.

antgonza · 2025-09-02T18:20:55Z

tests/test_Pipeline.py

@@ -207,43 +207,6 @@ def test_get_orig_names_from_sheet_with_replicates(self):

        self.assertEqual(set(obs), exp)

-    def test_required_file_checks(self):


These test were (1) specific to Illumina so they should not be part of Pipeline, (2) duplication of what normally happens in any code: check if a files exists and raise an error; thus, in my opinion, unnecessary.

AmandaBirmingham

Couple of typos and questions. Happy to look again when tests pass.

AmandaBirmingham · 2025-09-10T16:23:05Z

src/qp_klp/PacBioMetagenomicWorkflow.py

+        self.mandatory_attributes = ['qclient', 'uif_path',
+                                     'lane_number', 'config_fp',
+                                     'run_identifier', 'output_dir', 'job_id',
+                                     'lane_number', 'is_restart']


Why does lane_number need to be present twice?

It was a mistake; maybe we found the culprit of the duplicated lane?!

AmandaBirmingham · 2025-09-10T16:49:11Z

src/qp_klp/Protocol.py

+        return failed_samples
+
+    def generate_sequence_counts(self):
+        # for other isntances of generate_sequence_counts in other objects


typo: replace "isntances" with "instances"

AmandaBirmingham · 2025-09-10T16:54:16Z

src/qp_klp/scripts/pacbio_commands.py

+def generate_bam2fastq_commands(sample_list, run_folder, outdir, threads):
+    """Generates the bam2fastq commands"""
+    df = pd.read_csv(sample_list, sep='\t')
+    files = {f.split('.')[-2]: f


what is this split/slice grabbing from the file name?

Adding some comments about this.

AmandaBirmingham · 2025-09-10T16:58:04Z

src/sequence_processing_pipeline/Commands.py

+                # add to bucket_size
+                bucket_size += r1_size + r2_size
+                current_size += r1_size
+


I understand we are hurrying, but can we add an issue to refactor this copy/paste addition in the future? There is so much shared functionality between the two branches of the if/else.

I'll create an issue pointing to what needs to be refactored.

AmandaBirmingham · 2025-09-10T17:02:24Z

src/sequence_processing_pipeline/Commands.py

+
+    for d in openfps.values():
+        for f in d.values():
+            f.close()


The amount of code duplication between this and the original demux command feels to me like a real invitation to future bugs. Assuming we do not have time to refactor this now, can we have an issue to remind us to do it in the future?

AmandaBirmingham · 2025-09-10T17:15:55Z

src/sequence_processing_pipeline/NuQCJob.py

+        if 'chemistry' in header:
+            chemistry = header['chemistry']
+        else:
+            chemistry = ''


I suggest replacing with

chemistry = header.get('chemistry', '')

which will have the same effect, be shorter, and remove the need to repeat 'chemistry' :)

AmandaBirmingham · 2025-09-10T17:28:39Z

src/sequence_processing_pipeline/Pipeline.py

+                mtag = tree.getroot().tag.split('}')[0] + '}'
+                instrument_text = (
+                    f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run')
+                run = ET.parse(run_info).find(instrument_text)


I am really uncomfy with this being repeated. Could we make a method something like the below

def get_pacbio_run_str(run_info, run_directory): pacbio_metadata_fps = glob( f'{run_directory}/*/metadata/*.metadata.xml') if not pacbio_metadata_fps: raise ValueError(f"'{run_info}' doesn't exist") run_info = pacbio_metadata_fps[0] tree = ET.parse(run_info) mtag = tree.getroot().tag.split('}')[0] + '}' instrument_text = ( f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run') run = ET.parse(run_info).find(instrument_text) return run

and then call it in _get_instrument_id like

if not exists(run_info): run = get_pacbio_run_str(run_info, run_directory) text = run.attrib['TimeStampedName']

and in _get_instrument_type like

if not exists(run_info): run = get_pacbio_run_str(run_info, run_directory) date_string = run.attrib['CreatedAt'].split('T')[0]

Thank you; this made me realize that we could use the same method and replace: get_date_from_run_id.

AmandaBirmingham · 2025-09-10T17:35:05Z

tests/data/sample-sheets/metagenomic/pacbio/good_pacbio_metagv10.csv

+Description,,,,,,,
+,,,,,,,
+[Data],,,,,,,
+Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Sample_Project,Well_description,Lane,barcode_id


BTW, I don't know if you are round-tripping sample sheets at any point, but if you read this in and then write it back out, barcode_id will show up between Sample_Well and Sample_Project, not at the end, due to the forced ordering of the columns.

Good to know.

AmandaBirmingham · 2025-09-10T17:35:39Z

tests/test_PacBioWorflow.py

+            makedirs(f'{genprep_dir}/{sp}/filtered_sequences/',
+                     exist_ok=True)
+
+        # # then loop over samples and stage all fastq.gz files


AmandaBirmingham · 2025-09-10T17:43:39Z

src/sequence_processing_pipeline/templates/nuqc_job_single.sh

+demux-runner
+echo "$(date) :: demux stop"
+
+touch ${OUTPUT}/${SLURM_JOB_NAME}.${SLURM_ARRAY_TASK_ID}.completed


this seems like it is almost the same as nuqc_job.sh, with just a few deletions. Is this something we could consider refactoring in the future?

antgonza

Thank you @AmandaBirmingham for your review. Added this issue: #144.

antgonza · 2025-09-10T18:46:02Z

src/qp_klp/PacBioMetagenomicWorkflow.py

+        self.mandatory_attributes = ['qclient', 'uif_path',
+                                     'lane_number', 'config_fp',
+                                     'run_identifier', 'output_dir', 'job_id',
+                                     'lane_number', 'is_restart']


It was a mistake; maybe we found the culprit of the duplicated lane?!

antgonza · 2025-09-10T18:46:26Z

src/qp_klp/Protocol.py

+        return failed_samples
+
+    def generate_sequence_counts(self):
+        # for other isntances of generate_sequence_counts in other objects


antgonza · 2025-09-10T18:51:42Z

src/qp_klp/scripts/pacbio_commands.py

+def generate_bam2fastq_commands(sample_list, run_folder, outdir, threads):
+    """Generates the bam2fastq commands"""
+    df = pd.read_csv(sample_list, sep='\t')
+    files = {f.split('.')[-2]: f


Adding some comments about this.

antgonza · 2025-09-10T18:54:50Z

src/sequence_processing_pipeline/Commands.py

+                # add to bucket_size
+                bucket_size += r1_size + r2_size
+                current_size += r1_size
+


I'll create an issue pointing to what needs to be refactored.

antgonza · 2025-09-10T18:57:30Z

src/sequence_processing_pipeline/ConvertJob.py

+
+    def run(self, callback=None):
+        """
+        Run BCL2Fastq/BCLConvert conversion


Good catch, updating.

antgonza · 2025-09-10T19:01:34Z

src/sequence_processing_pipeline/NuQCJob.py

@@ -52,13 +53,15 @@ def __init__(self, fastq_root_dir, output_path, sample_sheet_path,
        :param additional_fastq_tags: A list of fastq tags to preserve during
        filtering.
        :param files_regex: the FILES_REGEX to use for parsing files
+        :param read_length: the FILES_REGEX to use for parsing files


Thank you for catching all these.

antgonza · 2025-09-10T19:07:57Z

tests/data/sample-sheets/metagenomic/pacbio/good_pacbio_metagv10.csv

+Description,,,,,,,
+,,,,,,,
+[Data],,,,,,,
+Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Sample_Project,Well_description,Lane,barcode_id


Good to know.

antgonza · 2025-09-10T19:43:57Z

src/sequence_processing_pipeline/Pipeline.py

+                mtag = tree.getroot().tag.split('}')[0] + '}'
+                instrument_text = (
+                    f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run')
+                run = ET.parse(run_info).find(instrument_text)


Thank you; this made me realize that we could use the same method and replace: get_date_from_run_id.

antgonza · 2025-09-10T22:46:30Z

Thank you for all the help @AmandaBirmingham

antgonza added 3 commits August 27, 2025 10:50

initial changes

5d65b48

Merge branch 'main' of https://github.com/qiita-spots/qp-knight-lab-p…

844e4bd

…rocessing into pacbio

init push

55ab210

antgonza changed the title ~~PacBio~~ [WIP] PacBio Sep 1, 2025

antgonza added 3 commits September 2, 2025 06:56

fix InstrumentUtils

4439472

rm test_required_file_checks

6bf3586

m11111_20250101_111111_s4

411a119

antgonza commented Sep 2, 2025

View reviewed changes

antgonza added 22 commits September 2, 2025 16:09

add conda

e8a2d0b

pacbio_generate_bam2fastq_commands

5e16ab1

self.node_count -> self.nprocs

827a372

generate_sequence_counts

551bfd1

init changes to def _inject_data(self, wf):

6311723

read_length

40afadf

pmls_extra_parameters

52f4c53

rm index in _inject_data

e18cf5d

dstats

0734a6e

nuqc_job_single.sh

281bf75

rm extra ,

1aa102c

print Profile selected

01c862d

counts.txt in _inject_data

05036b7

demux_just_fwd

d5adee0

demux_single -? demux_just_fwd

2fcfc28

add cli for demux_just_fwd

27f49d1

fix demux_just_fwd params

dea4fd9

rm demux_just_fwd_processing splitter_binary

e596ade

self.files_regex = long_read

b27e7d6

sample_id_column

1666516

mv self.read_length = read_length up

0141440

_filter_empty_fastq_files

2b5700f

antgonza added 2 commits September 6, 2025 20:31

zip_longest

cc29d39

raw_reads_r1r2

3a94746

antgonza requested a review from AmandaBirmingham September 9, 2025 01:56

antgonza added 9 commits September 9, 2025 10:07

fastq.gz -> trimmed.fastq.gz

6e12dd5

barcode

b2a308e

barcode -> barcode_id

1da1abb

S000

e600bfb

del raw_reverse_seqs

d73b6a7

test filenames

c2b0a3e

S000_L001_R1_001.counts.txt

b2b4d60

{rec}{sid}

0051f49

rm touch for gz files

1418cbc

AmandaBirmingham requested changes Sep 10, 2025

View reviewed changes

antgonza added 2 commits September 10, 2025 14:18

add_default_workflow

62e388e

fixing counts

27efe2d

antgonza commented Sep 10, 2025

View reviewed changes

fix TestPipeline.test_generate_sample_information_files

a323f46

antgonza changed the title ~~[WIP] PacBio~~ PacBio Sep 10, 2025

AmandaBirmingham self-requested a review September 10, 2025 21:06

AmandaBirmingham approved these changes Sep 10, 2025

View reviewed changes

restart changes

48010d3

antgonza merged commit 7f33624 into qiita-spots:main Sep 10, 2025
2 checks passed

		@@ -207,43 +207,6 @@ def test_get_orig_names_from_sheet_with_replicates(self):

		self.assertEqual(set(obs), exp)

		def test_required_file_checks(self):

PacBio #143

PacBio #143

Uh oh!

Conversation

antgonza commented Sep 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmandaBirmingham left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antgonza left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antgonza commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!